Semantic Word Clouds with Background Corpus Normalization and t-distributed Stochastic Neighbor Embedding

نویسندگان

  • Erich Schubert
  • Andreas Spitz
  • Michael Weiler
  • Johanna Geiß
  • Michael Gertz
چکیده

Many word clouds provide no semantics to the word placement, but use a random layout optimized solely for aesthetic purposes. We propose a novel approach to model word signi cance and word a nity within a document, and in comparison to a large background corpus. We demonstrate its usefulness for generating more meaningful word clouds as a visual summary of a given document. We then select keywords based on their signi cance and construct the word cloud based on the derived a nity. Based on a modi ed t-distributed stochastic neighbor embedding (t-SNE), we generate a semantic word placement. For words that cooccur signi cantly, we include edges, and cluster the words according to their cooccurrence. For this we designed a scalable and memory-e cient sketch-based approach usable on commodity hardware to aggregate the required corpus statistics needed for normalization, and for identifying keywords as well as signi cant cooccurences. We empirically validate our approch using a large Wikipedia corpus.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Distinguish Polarity in Bag-of-Words Visualization

Neural network-based BOW models reveal that wordembedding vectors encode strong semantic regularities. However, such models are insensitive to word polarity. We show that, coupled with simple information such as word spellings, word-embedding vectors can preserve both semantic regularity and conceptual polarity without supervision. We then describe a nontrivial modification to the t-distributed...

متن کامل

Better Word Embeddings for Korean

Vector representations of words that capture semantic and syntactic information accurately is critical for the performance of models that use these vectors as inputs. Algorithms that only use the surrounding context at the word level ignore the subword level relationships which carry important meaning especially for languages that are highly inflected such as Korean. In this paper we compare th...

متن کامل

Syntactico Semantic Word Representations in Multiple Languages

Our project is an extension of the project “Syntactico Semantic Word Representations in Multiple Languages”[1]. The previous project aims to improve the semantical representation of English vocabulary via incorporating the local context with global context and supplying homonymy and polysemy for multiple embeddings per word. It also introduces a new neural network architecture that learns the w...

متن کامل

Text comparison using word vector representations and dimensionality reduction

This paper describes a technique to compare large text sources using word vector representations (word2vec) and dimensionality reduction (tSNE) and how it can be implemented using Python. The technique provides a bird’s-eye view of text sources, e.g. text summaries and their source material, and enables users to explore text sources like a geographical map. Word vector representations capture m...

متن کامل

Temporal Semantic Analysis and Visualization of Words

Today there are many languages spoken in the world, among which English is the most popular one. However, words in English evolved a lot in history such that it is very difficult for contemporary people to read ancient English articles. There are many changes, such as the mutation of word itself, the migration of word usage from one context to another, etc. It is thus very interesting to unders...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1708.03569  شماره 

صفحات  -

تاریخ انتشار 2017